Add video text to text docs by merveenoyan · Pull Request #33164 · huggingface/transformers

merveenoyan · 2024-08-28T11:47:18Z

Adding video-text-to-text task guide

@zucchini-nlp @NielsRogge @amyeroberts @stevhliu

merveenoyan · 2024-08-28T11:48:46Z

There seems to be an issue with Llava, in the latest stable some image tokens are cropped, i.e. when I pass 12 downsampled frames and 12 tokens it tells me to add 13 image tokens instead. It's fixed in main, but now I get TypeError: LlavaForConditionalGeneration.forward() got an unexpected keyword argument 'num_logits_to_keep' from result = self._sample(input_ids, logits_processor=prepared_logits_processor).

zucchini-nlp · 2024-08-28T12:00:26Z

Caused by #31292, will work on it

stevhliu

Very nice, thanks for adding! 🔥

Remember to add the doc to the toctree!

docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

HuggingFaceDocBuilderDev · 2024-08-29T09:49:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

LGTM - thanks for adding! ❤️

zucchini-nlp

Yay, thanks for adding this! Looks good but I wan thinking of adding an inference with pure VideoLLMs, WDYT?

zucchini-nlp · 2024-08-29T20:07:34Z

docs/source/en/tasks/video_text_to_text.md

+
+Now we can preprocess the inputs.
+
+This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess.


Hmm seems to be a typo, 8 frames each video makes total 12 frames?

sorry I changed later to 7B model so should've modified this

zucchini-nlp · 2024-08-29T20:12:15Z

docs/source/en/tasks/video_text_to_text.md

+- chat fine-tuned models for conversation
+- instruction fine-tuned models
+
+This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model.


Maybe we could/should add an example with pure VideoLLM where we don't have to manually replicate image token several times and where the model has special treatment for videos, like extra pooling layers

llava-next-video or video-llava can be an option for that

I thought most models are coming out as interleaved so actually using an interleaved example is good since they're harder to get started with. I can add simple VideoLLM example separately with chat templates though

Yes, they are mostly interleaved. The difference with llava-interleave is that we didn't add a new model for that, so it's kinda an image LLM used for video. For all others I am trying to make two separate processors, for images and for videos, with their own special tokens

okay I'll add a video only one and modify when you make the processors, does it sound good?

yep, thanks :)

docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

merveenoyan · 2024-08-30T10:47:19Z

@zucchini-nlp re: slack discussions I'd say we merge this and edit when the processors are out.

zucchini-nlp

Yes, sounds good to me. We'll let users to discover how each model expects the inputs by model card, as there's no one standard yet and we don't support natively video-only LLMs

Approved, thanks! 💛

--------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

merveenoyan added 3 commits August 28, 2024 14:25

video-text-to-text task guide

4fbe42c

nit

7e7938e

nit

f109f24

zucchini-nlp mentioned this pull request Aug 28, 2024

Fix: num_logits_to_keep in composite models #33168

Merged

stevhliu approved these changes Aug 28, 2024

View reviewed changes

merveenoyan and others added 7 commits August 29, 2024 12:23

Update docs/source/en/tasks/video_text_to_text.md

7a25ee6

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/tasks/video_text_to_text.md

24d8192

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/tasks/video_text_to_text.md

d3056ad

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/tasks/video_text_to_text.md

6384fbc

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/tasks/video_text_to_text.md

c271ed4

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/tasks/video_text_to_text.md

891d845

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Readability pass and nits

c92e3bc

amyeroberts reviewed Aug 29, 2024

View reviewed changes

amyeroberts approved these changes Aug 29, 2024

View reviewed changes

zucchini-nlp reviewed Aug 30, 2024

View reviewed changes

Update docs/source/en/tasks/video_text_to_text.md

12f1cad

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

zucchini-nlp approved these changes Aug 30, 2024

View reviewed changes

Fix sampling

e9878c7

merveenoyan merged commit 2e3f8f7 into huggingface:main Sep 1, 2024

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Add video text to text docs (huggingface#33164)

f78b082

--------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>


		Now we can preprocess the inputs.

		This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess.

Conversation

merveenoyan commented Aug 28, 2024

Uh oh!

merveenoyan commented Aug 28, 2024

Uh oh!

zucchini-nlp commented Aug 28, 2024

Uh oh!

stevhliu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 29, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

merveenoyan Aug 30, 2024

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

merveenoyan Aug 30, 2024

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 30, 2024

Choose a reason for hiding this comment

Uh oh!

merveenoyan Aug 30, 2024

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

merveenoyan commented Aug 30, 2024

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stevhliu left a comment •

edited

Loading